Today, we will mainly focus on methods for analyzing and forecasting regular time-series data with seasonality patterns
By the end of this workshop, you probably won’t become an expert in time series analysis and forecasting, but you will be able to:
All today’s slides, code, and rmarkdown files are available on GitHub
Downloading the workshop material from the terminal:
git clone https://github.com/RamiKrispin/Time-Series-Workshop.git
install.packages("forecast", "plotly", "ggplot2", "dplyr", "UKgrid", "fpp2", "shiny", "tsibble", "dygraph")
devtools::install_github("RamiKrispin/TSstudio")
devtools::install_github("RamiKrispin/forecastML")
Time series analysis is commonly used in many fields of science, such as economics, finance, physics, engineering, and astronomy. The usage of time series analysis to understand past events and to predict future ones did not start with the introduction of the stochastic process during the past century. Ancient civilizations such as the Greeks, Romans, or Mayans, researched and learned how to utilize cycled events such as weather and astronomy to predict future events.
Time series analysis - is the art of extracting meaningful insights from time-series data to learn about past events and to predict future events.
This process includes the following steps:
Generally, in R this process will look like this:
Of course, there are more great packages that could be part of this process such as zoo, xts, bsts, forecastHybird, prophet, etc.
The TSstudio package provides a set of functions for time series analysis. That includes interactive data visualization tools based on the plotly package engine, supporting multiple time series objects such as ts, xts, and zoo. The following diagram demonstrates the workflow of the TSstudio package:
Time series data - is a sequence of values, each associate to a unique point in time that can divide to the following two groups:
Note: typically, the term time series data referred to regular time-series data. Therefore, if not stated otherwise, throughout the workshop the term time series (or series) refer to regular time-series data
With time series analysis, you can answer questions such as:
There are multiple classes in R for time-series data, the most common types are:
ts class for regular time-series data, and mts class for multiple time seires objects , the most common class for time series dataxts and zoo classes for both regular and irregular time series data, mainly popular in the financial fieldtsibble class, a tidy format for time series data, support both regular and irregular time-series dataA typical time series object should have the following attributes:
Where the frequency of the series represents the units of the cycle. For example, for monthly series, the frequency units are the month of the year, and the cycle units are the years. Similarly, for daily series, the frequency units could be the day of the year, and the cycle units are also the years.
The stats package provides a set of functions for handling and extracting information from a ts object. The frequency and cycle functions, as their names implay return the frequency and the cycle, respectivly, of the object. Let’s load the USgas series from the TSstudio package and apply those functions:
library(TSstudio)
data(USgas)
class(USgas)
## [1] "ts"
is.ts(USgas)
## [1] TRUE
frequency(USgas)
## [1] 12
cycle(USgas)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2000 1 2 3 4 5 6 7 8 9 10 11 12
## 2001 1 2 3 4 5 6 7 8 9 10 11 12
## 2002 1 2 3 4 5 6 7 8 9 10 11 12
## 2003 1 2 3 4 5 6 7 8 9 10 11 12
## 2004 1 2 3 4 5 6 7 8 9 10 11 12
## 2005 1 2 3 4 5 6 7 8 9 10 11 12
## 2006 1 2 3 4 5 6 7 8 9 10 11 12
## 2007 1 2 3 4 5 6 7 8 9 10 11 12
## 2008 1 2 3 4 5 6 7 8 9 10 11 12
## 2009 1 2 3 4 5 6 7 8 9 10 11 12
## 2010 1 2 3 4 5 6 7 8 9 10 11 12
## 2011 1 2 3 4 5 6 7 8 9 10 11 12
## 2012 1 2 3 4 5 6 7 8 9 10 11 12
## 2013 1 2 3 4 5 6 7 8 9 10 11 12
## 2014 1 2 3 4 5 6 7 8 9 10 11 12
## 2015 1 2 3 4 5 6 7 8 9 10 11 12
## 2016 1 2 3 4 5 6 7 8 9 10 11 12
## 2017 1 2 3 4 5 6 7 8 9 10 11 12
## 2018 1 2 3 4 5 6 7 8 9 10 11 12
## 2019 1 2 3 4 5 6 7
The time function returns the series index or timestamp:
head(time(USgas))
## [1] 2000.0000 2000.0833 2000.1667 2000.2500 2000.3333 2000.4167
The deltat function returns the length of series’ time interval (which is equivalent to 1/frequency):
deltat(USgas)
## [1] 0.083333333
Similarly, the start and end functions return the starting and ending time of the series, respectively:
start(USgas)
## [1] 2000 1
end(USgas)
## [1] 2019 7
Where the left number represents the cycle units, and the right side represents the frequency units of the series. The tsp function returns both the start and end of the series and its frequency:
tsp(USgas)
## [1] 2000.0 2019.5 12.0
Last but not least, the ts_info function from the TSstudio package returns a concise summary of the series:
ts_info(USgas)
## The USgas series is a ts object with 1 variable and 235 observations
## Frequency: 12
## Start time: 2000 1
## End time: 2019 7
The ts function allows to create a ts object from a single vector and a mts object from a multiple vectors (or matrix). By defining the start (or end) and frequency of the series, the function generate the object index. In the following example we will load the US_indicators dataset from the TSstudio package and convert it to a ts object. The US_indicators is a data.frame with the monthly vehicle sales and unemployment rate in the US since 1976:
data(US_indicators)
head(US_indicators)
## Date Vehicle Sales Unemployment Rate
## 1 1976-01-31 885.2 8.8
## 2 1976-02-29 994.7 8.7
## 3 1976-03-31 1243.6 8.1
## 4 1976-04-30 1191.2 7.4
## 5 1976-05-31 1203.2 6.8
## 6 1976-06-30 1254.7 8.0
mts_obj <- ts(data = US_indicators[, c("Vehicle Sales", "Unemployment Rate")],
start = c(1976, 1),
frequency = 12)
ts_info(mts_obj)
## The mts_obj series is a mts object with 2 variables and 524 observations
## Frequency: 12
## Start time: 1976 1
## End time: 2019 8
| Series Type | Cycle Units | Frequency Units | Frequency | Example |
|---|---|---|---|---|
| Quarterly | Years | Quarter of the year | 4 | ts(x, start = c(2019, 2), frequency = 4) |
| Monthly | Years | Month of the year | 12 | ts(x, start = c(2019, 1), frequency = 12) |
| Weekly | Years | Week of the year | 52 | ts(x, start = c(2019, 13), frequency = 52) |
| Daily | Years | Day of the year | 365 | ts(x, start = c(2019, 290), frequency = 365) |
What if you have more granular time series data such as half-hour, 15 or five minutes intervals?
Me when needed to work with daily time series using ts object:
The ts object was designed for work with monthly, quarterly, or yearly series that have only two-time components (e.g., year and month). Yet, more granular series (high frequency) may have more than two-time components. A common example is a daily series that has the following time attributes:
When going to the hourly or minute levels, this is even adding more components such as the hour, minute, etc.
The zoo, xts classes and now the tsibble class provide solution for this issue.
“The tsibble package provides a data infrastructure for tidy temporal data with wrangling tools…”
In other words, the tsibble object allows you to work with a data frame alike (i.e., tbl object) with a time awareness attribute. The key characteristics of this class:
tbl object - can apply any of the normal tools to reformat, clean or modify tbl object such as dplyr functionsThe reaction of me and my colegues when the tsibble package was released:
library(UKgrid)
data(UKgrid)
class(UKgrid)
## [1] "data.frame"
head(UKgrid)
## TIMESTAMP ND I014_ND TSD I014_TSD ENGLAND_WALES_DEMAND
## 1 2011-01-01 00:00:00 34606 34677 35648 35685 31058
## 2 2011-01-01 00:30:00 35092 35142 36089 36142 31460
## 3 2011-01-01 01:00:00 34725 34761 36256 36234 31109
## 4 2011-01-01 01:30:00 33649 33698 35628 35675 30174
## 5 2011-01-01 02:00:00 32644 32698 34752 34805 29253
## 6 2011-01-01 02:30:00 32092 32112 34134 34102 28688
## EMBEDDED_WIND_GENERATION EMBEDDED_WIND_CAPACITY
## 1 484 1730
## 2 520 1730
## 3 520 1730
## 4 512 1730
## 5 512 1730
## 6 464 1730
## EMBEDDED_SOLAR_GENERATION EMBEDDED_SOLAR_CAPACITY NON_BM_STOR
## 1 0 79 0
## 2 0 79 0
## 3 0 79 0
## 4 0 79 0
## 5 0 79 0
## 6 0 79 0
## PUMP_STORAGE_PUMPING I014_PUMP_STORAGE_PUMPING FRENCH_FLOW BRITNED_FLOW
## 1 60 67 1939 0
## 2 16 20 1939 0
## 3 549 558 1989 0
## 4 998 997 1991 0
## 5 1126 1127 1992 0
## 6 1061 1066 1992 0
## MOYLE_FLOW EAST_WEST_FLOW I014_FRENCH_FLOW I014_BRITNED_FLOW
## 1 -382 0 1922 0
## 2 -381 0 1922 0
## 3 -382 0 1974 0
## 4 -381 0 1975 0
## 5 -382 0 1975 0
## 6 -381 0 1975 0
## I014_MOYLE_FLOW I014_EAST_WEST_FLOW
## 1 -382 0
## 2 -381 0
## 3 -382 0
## 4 -381 0
## 5 -382 0
## 6 -381 0
library(dplyr)
library(tsibble)
data(UKgrid)
uk_grid <- UKgrid %>%
dplyr::select(time = TIMESTAMP,
net_demand = ND) %>%
tsibble::as_tsibble(index = time)
head(uk_grid)
## # A tsibble: 6 x 2 [30m] <UTC>
## time net_demand
## <dttm> <int>
## 1 2011-01-01 00:00:00 34606
## 2 2011-01-01 00:30:00 35092
## 3 2011-01-01 01:00:00 34725
## 4 2011-01-01 01:30:00 33649
## 5 2011-01-01 02:00:00 32644
## 6 2011-01-01 02:30:00 32092
class(uk_grid)
## [1] "tbl_ts" "tbl_df" "tbl" "data.frame"
index(uk_grid)
## time
tsibble::interval(uk_grid)
## 30m
Like most common fields of statistics and machine learning, the goal of the descriptive analysis is to reveal meaningful insights about the series with the use of descriptive statistics and data visualization tools.
The plot function or plot.ts functions are R built-in functions for plotting time series object:
data("USVSales")
plot.ts(USVSales,
main = "US Monthly Total Vehicle Sales",
ylab = "Thousands of units",
xlab = "Date")
Alternatively, the ts_plot function from the TSstudio package provides an interactive visualization for time series object (ts, xts, zoo, tsibble, ets.). This function using the plotly package plotting engine:
ts_plot(USVSales,
title = "US Monthly Total Vehicle Sales",
Ytitle = "Thousands of units",
Xtitle = "Date",
slider = TRUE)
The main advantage of using interactive data visualization tools that it allows you to zoom in the data with a click of a button. This is super useful when working with data and in particular, with time-series data.
The dygraphs package is another great tool for visualization time series data:
library(dygraphs)
dygraph(USVSales,
main = "US Monthly Total Vehicle Sales",
ylab = "Thousands of units",
xlab = "Date") %>%
dyRangeSelector()
Time series data, typically, would have two types of patterns:
Structural patterns:
Nonstructural patterns
The irregular component - which include any other patterns that are not captured by the trend, cycle, and seasonal components. For example structural changes, non-seasonal holidays effect, etc.
Together, the structural and non-structural patterns compounding the series, which can be formalized by the following expressions:
\(Y_t = T_t + C_t + S_t + I_t\), when the series has an additive structure, or
\(Y_t = T_t \times C_t \times S_t \times I_t\), when the series has a multiplicative structure
Applying log transformation on multiplicative series will transfome the series into additive structure:
\(log(Y_t) = log(T_t) + log(C_t) + log(S_t) + log(I_t)\)
We typically either ignore the cycle or embed it with the trend component, therefore:
\[Y_t = T_t + S_t + I_t\]
The decompose function from the stats decompose a time series into seasonal, trend and irregular components using moving averages. The ts_decompose function from the TSstudio provides an interactive wraper for the decompose function:
ts_decompose(USgas)
Seasonality is one of the most dominant components in time series data (when exists) and it derived from the frequency units of the series (e.g., the month of the year for monthly time series data)
Furthermore, as funny as it may sound, most of the seasonal patterns in nature are related to two astrophysical phenomena:
For instance, the seasonality patterns of natural phenomena such as weather (temperature, rain, and snow fall), sunrise and sunset times, or the tide level are dictated directly from the orbital period and the solar time of Earth.
Seasonal types:
Data visualization helps to identify seasonal patterns in the series:
library(TSstudio)
data(USgas)
USgas_df <- data.frame(year = floor(time(USgas)), month = cycle(USgas),
USgas = as.numeric(USgas))
# Setting the month abbreviation and transforming it to a factor
USgas_df$month <- factor(month.abb[USgas_df$month], levels = month.abb)
library(ggplot2)
ggplot(USgas_df, aes(x = USgas)) +
geom_density(aes(fill = month)) +
ggtitle("USgas - Kernel Density Estimates by Month") +
facet_grid(rows = vars(as.factor(month)))
Note that this plot take into account the trend of the series, let’s detrend the series and replot it:
USgas_df <- data.frame(year = floor(time(USgas)), month = cycle(USgas),
USgas = as.numeric(USgas - decompose(USgas)$trend))
# Setting the month abbreviation and transforming it to a factor
USgas_df$month <- factor(month.abb[USgas_df$month], levels = month.abb)
library(ggplot2)
ggplot(USgas_df, aes(x = USgas)) +
geom_density(aes(fill = month)) +
ggtitle("USgas - Kernel Density Estimates by Month") +
facet_grid(rows = vars(as.factor(month)))
The ts_seasonal function is a castumize function for seasonal plot of low-frequency time series data (e.g., daily, monthly, quarterly):
ts_seasonal(USgas, type = "normal")
ts_seasonal(USgas, type = "cycle")
ts_seasonal(USgas, type = "box")
When setting the type argument to all it will combine the three plots together. Let’s again detrend the series and see the seasonal component of the series:
USgas_decompose <- USgas - decompose(USgas)$trend
ts_seasonal(USgas_decompose, type = "all")
Similarly, we can use a heatmap, surface, or polar plots:
ts_heatmap(USgas, color = "Reds")
ts_heatmap(USVSales, color = "Reds")
ts_surface(USgas)
ts_polar(USgas)
library(UKgrid)
UKgrid_df <- extract_grid(type = "data.frame",
columns = "ND",
aggregate = "hourly",
na.rm = TRUE)
ts_quantile(UKgrid_df)
ts_quantile(UKgrid_df, period = "weekdays", n = 2)
ts_quantile(UKgrid_df, period = "monthly", n = 2)
Due to the continuous and chronologically ordered nature of time series data, there is a likelihood that there will be some degree of correlation between the series observations. For instance, the temperature in the next hour is not a random event since, in most cases, it has a strong relationship with the current temperature or the temperatures that have occurred during the past 24 hours.
In the context of forecasting and time series, we love correlated time series data!
The acf and pacf functions from the stats package plot the Auto-Correlation and Partial Auto-Correlation of the series with its legs:
UKgrid_daily <- extract_grid(type = "ts", aggregate = "daily")
acf(UKgrid_daily)
pacf(UKgrid_daily)
The ts_cor from the TSstudio return more details plot of the ACF and PACF of the function
ts_cor(UKgrid_daily, lag.max = 365 * 2)
ts_cor(UKgrid_daily, lag.max = 365 * 2, seasonal_lags = 7)
A more intuitive way to plot correlation is the lag plot, by plotting the series against its past lags:
ts_lags(USgas)
ts_lags(USgas, lags = c(12, 24, 36))
The primary usage of the linear regression model is to quantify the relationship between the dependent variable Y (also known as the response variable) and the independent variable/s X (also known as the predictors, drivers or regressors variables) in a linear manner.
\[Y_{i} = \beta_{0} + \beta_{1}\times X_{1,i} + \epsilon_{i}\]
\[Y_{i} = \beta_{0} + \beta_{1}\times X_{1,i} + \beta_{2}\times X_{2,i} + ...+ \beta_{n}\times X_{n,i} + \epsilon_{i}\]
Big misconception
The term linear, in the context of regression, referred to the model coefficients, which must follow a linear structure (as this allows us to construct a linear combination from the independent variables). On the other hand, the independent variables can follow a linear and non-linear formation.
The Ordinary Least Squares method (or OLS) is a simple optimization method which is based on basic linear algebra and calculus, or matrix calculus (this section is for general knowledge if you are not familiar with matrix calculus you can skip this section). The goal of the OLS is to identify the coefficients that minimize the residuals sum of squares. If the residual of the i observation defines as:
\[\hat{\epsilon_{i}} = Y_{i} - \hat{Y_{i}}\]
Then we can set the cost function by the following expression:
\[\sum_{i=1}^N \hat{\epsilon_{i}^2} = (Y_{1} - \hat{Y_{1}})^2 + (Y_{2} - \hat{Y_{2}})^2 + ... + (Y_{n} - \hat{Y_{n}})^2\]
Before applying the OLS method for minimizing the residuals sum of squares, for simplicity reasons, we will transform the representative of the cost function into a matrix formation:
\(\mathbf{Y} = \left[\begin{array}{rrr}Y_{1} \\Y_{2} \\.\\.\\.\\Y_{N}\end{array}\right]\) , \(\mathbf{X} = \left[\begin{array}{rrr}1 & X_{1,1}&.&.&.&X_{1,n} \\. \\.\\.\\1 & X_{N,1}&.&.&.&X_{N,n}\end{array}\right]\), \(\mathbf{\beta} = \left[\begin{array}{rrr}\beta_{0} \\\beta_{1} \\.\\.\\.\\\beta_{n}\end{array}\right]\), \(\mathbf{\epsilon} =Y-X\beta= \left[\begin{array}{rrr}\epsilon_{1} \\\epsilon_{2} \\.\\.\\.\\\epsilon_{N}\end{array}\right]\),
Where those set of matrices represents the following:
Note: the residual term \(\hat{\epsilon_{i}}\) should not be confused with the error term \(\epsilon_{i}\).While the first represents the difference between the series Y and its estimate \(\hat{Y}\), the second (error term) represents the difference between the series and its expected value
Let’s set the cost function using the matrix form as we defined above: \[\sum {\epsilon^2} = {\epsilon}^T{\epsilon}\]
We will now start to expand this expression by using the formula of \(\epsilon\) as we outlined above:
\[{\epsilon}^T{\epsilon} = (Y-X\beta)^T(Y-X\beta)\]
Next, we will multiply the two components (\(\epsilon^T\) and \(\epsilon\)) and open the brackets:
\[\epsilon ^{T}\epsilon = Y^TY-2Y^TX\beta+\beta^TX^TX\beta\]
Since our goal is to find the \(\beta\) that minimizes this equation, we will differentiate the equation with respect to \(\beta\) and then set it to zero:
\[ \frac{\partial\epsilon ^{T}\epsilon}{\partial \beta}=\frac{\partial (Y^TY-2Y^TX\beta+\beta^TX^TX\beta)}{\partial\beta} = 0\]
Solving this equation will yield the following output:
\[X^TX\beta=X^TY\] Manipulating this equation, allow us to extract \(\hat{\beta}\) matrix, the estimate of the coefficient matrix \(\beta\):
\[\hat\beta=(X^TX)^{-1}X^TY\]
Note: that we changed the notation of \(\beta\) to \(\hat\beta\) on the final output as it represents the estimate of the true value of \(\beta\).
The key properties of the OLS coefficients estimation:
We want to represent the different components of the series in a linear model formation:
\(Y_{t} = T_{t} + S_{t} + C_{t} + I_{t}\), when the series has an additive structure, or
\(Y_{t} = T_{t} \times S_{t} \times C_{t} \times I_{t}\), when the series has an multiplicative structure, where
Features creating is the name of the game here…